{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "import sys\n",
    "import os\n",
    "if not any(path.endswith('textbook') for path in sys.path):\n",
    "    sys.path.append(os.path.abspath('../../..'))\n",
    "from textbook_utils import *"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(sec:pa_collocated)=\n",
    "# Finding Collocated Sensors"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our analysis begins by finding collocated pairs of AQS and PurpleAir sensors---sensors that are placed essentially next to each other.\n",
    "This step is important because it lets us reduce the effects of other variables that might cause differences in sensor readings.\n",
    "Consider what would happen if we compared an AQS sensor placed in a park with a PurpleAir sensor placed along a busy freeway. \n",
    "The two sensors would have different readings, in part because the sensors are exposed to different environments.\n",
    "Ensuring that sensors are truly collocated lets us claim the differences in sensor readings are due to how the sensors are built and to small, localized air fluctuations, rather than\n",
    "other potential confounding variables."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Barkjohn's analysis conducted by the EPA group found pairs of AQS and PurpleAir sensors that\n",
    "are installed within 50 meters of each other.\n",
    "The group contacted each AQS site to see whether the PurpleAir sensor was also maintained there.\n",
    "This extra effort gave them confidence that their sensor pairs were truly collocated.\n",
    "\n",
    "In this section, we explore and clean location data from AQS and PurpleAir.\n",
    "Then we perform a join of sorts to construct a list of potentially collocated sensors.\n",
    "We won't contact AQS sites ourselves;\n",
    "instead, we proceed in later sections with Barkjohn's list of confirmed\n",
    "collocated sensors."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We downloaded a list of AQS and PurpleAir sensors and saved the data in the files\n",
    " _data/list_of_aqs_sites.csv_\n",
    "and _data/list_of_purpleair_sensors.json_.\n",
    "Let's begin by reading these files into `pandas` dataframes.\n",
    "First, we check file sizes to see whether they are reasonable to load into memory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-rw-r--r--  1 sam  staff   4.8M Oct 27 16:54 data/list_of_aqs_sites.csv\n",
      "-rw-r--r--  1 sam  staff   3.8M Oct 22 16:10 data/list_of_purpleair_sensors.json\n"
     ]
    }
   ],
   "source": [
    "!ls -lLh data/list_of*"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Both files are relatively small. Let's start with the list of AQS sites."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrangling the List of AQS Sites"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have filtered the [AQS map of sites](https://www.epa.gov/outdoor-air-quality-data/interactive-map-air-quality-monitors) to show only the AQS sites that measure PM2.5, and then downloaded the list of sites as a CSV file using the map's web app. Now we can load it into a `pandas` dataframe.:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(1333, 28)"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aqs_sites_full = pd.read_csv('data/list_of_aqs_sites.csv')\n",
    "aqs_sites_full.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are 28 columns in the table. Let's check the column names:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['AQS_Site_ID', 'POC', 'State', 'City', 'CBSA', 'Local_Site_Name',\n",
       "       'Address', 'Datum', 'Latitude', 'Longitude', 'LatLon_Accuracy_meters',\n",
       "       'Elevation_meters_MSL', 'Monitor_Start_Date', 'Last_Sample_Date',\n",
       "       'Active', 'Measurement_Scale', 'Measurement_Scale_Definition',\n",
       "       'Sample_Duration', 'Sample_Collection_Frequency',\n",
       "       'Sample_Collection_Method', 'Sample_Analysis_Method',\n",
       "       'Method_Reference_ID', 'FRMFEM', 'Monitor_Type', 'Reporting_Agency',\n",
       "       'Parameter_Name', 'Annual_URLs', 'Daily_URLs'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aqs_sites_full.columns"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To find out which columns are most useful for us, we reference the [data dictionary](https://aqs.epa.gov/aqsweb/airdata/FileFormats.html#_monitor_description_file) that the AQS provides on its website.\n",
    "There we confirm that the data table contains information about the AQS sites.\n",
    "So we might expect the granularity corresponds to an AQS site; meaning each row represents a single site\n",
    "and the column labeled `AQS_Site_ID` is the primary key. We can confirm this with a count of records for each ID:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "06-071-0306    4\n",
       "19-163-0015    4\n",
       "39-061-0014    4\n",
       "              ..\n",
       "46-103-0020    1\n",
       "19-177-0006    1\n",
       "51-680-0015    1\n",
       "Name: AQS_Site_ID, Length: 921, dtype: int64"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aqs_sites_full['AQS_Site_ID'].value_counts()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It looks like some sites appear multiple times in this dataframe.\n",
    "Unfortunately, this means that the granularity is finer than the individual site level.\n",
    "To figure out why sites are duplicated, let's take a closer look at the rows for one duplicated site:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "dup_site = aqs_sites_full.query(\"AQS_Site_ID == '19-163-0015'\")"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We select a few columns to examine based on their names---those that sound like they might shed some light on the reason for duplicates:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>POC</th>\n",
       "      <th>Monitor_Start_Date</th>\n",
       "      <th>Last_Sample_Date</th>\n",
       "      <th>Sample_Collection_Method</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>458</th>\n",
       "      <td>1</td>\n",
       "      <td>1/27/1999</td>\n",
       "      <td>8/31/2021</td>\n",
       "      <td>R &amp; P Model 2025 PM-2.5 Sequential Air Sampler...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>459</th>\n",
       "      <td>2</td>\n",
       "      <td>2/9/2013</td>\n",
       "      <td>8/26/2021</td>\n",
       "      <td>R &amp; P Model 2025 PM-2.5 Sequential Air Sampler...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>460</th>\n",
       "      <td>3</td>\n",
       "      <td>1/1/2019</td>\n",
       "      <td>9/30/2021</td>\n",
       "      <td>Teledyne T640 at 5.0 LPM</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>461</th>\n",
       "      <td>4</td>\n",
       "      <td>1/1/2019</td>\n",
       "      <td>9/30/2021</td>\n",
       "      <td>Teledyne T640 at 5.0 LPM</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     POC Monitor_Start_Date Last_Sample_Date  \\\n",
       "458    1          1/27/1999        8/31/2021   \n",
       "459    2           2/9/2013        8/26/2021   \n",
       "460    3           1/1/2019        9/30/2021   \n",
       "461    4           1/1/2019        9/30/2021   \n",
       "\n",
       "                              Sample_Collection_Method  \n",
       "458  R & P Model 2025 PM-2.5 Sequential Air Sampler...  \n",
       "459  R & P Model 2025 PM-2.5 Sequential Air Sampler...  \n",
       "460                           Teledyne T640 at 5.0 LPM  \n",
       "461                           Teledyne T640 at 5.0 LPM  "
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "some_cols = ['POC', 'Monitor_Start_Date',\n",
    "             'Last_Sample_Date', 'Sample_Collection_Method']\n",
    "dup_site[some_cols]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `POC` column looks to be useful for distinguishing between rows in the table. The data dictionary  states this about the column:\n",
    "\n",
    "> This is the “Parameter Occurrence Code” used to distinguish different instruments that measure the same parameter at the same site.\n",
    "\n",
    "So, the site `19-163-0015` has four instruments that all\n",
    "measure PM2.5. The granularity of the dataframe is at the level of a single instrument.\n",
    "\n",
    "Since our aim is to match AQS and PurpleAir sensors,\n",
    "we can adjust the granularity by selecting one instrument from each AQS site.\n",
    "To do this, we group rows according to site ID, then take the first row in each group:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def rollup_dup_sites(df):\n",
    "    return (\n",
    "        df.groupby('AQS_Site_ID')\n",
    "        .first()\n",
    "        .reset_index()\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(921, 28)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aqs_sites = (aqs_sites_full\n",
    "             .pipe(rollup_dup_sites))\n",
    "aqs_sites.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the number of rows matches the number of unique IDs."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To match AQS sites with PurpleAir sensors, we only need the site ID, latitude, and longitude.\n",
    "So we further adjust the structure and keep only those columns:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cols_aqs(df):\n",
    "    subset = df[['AQS_Site_ID', 'Latitude', 'Longitude']]\n",
    "    subset.columns = ['site_id', 'lat', 'lon']\n",
    "    return subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "aqs_sites = (aqs_sites_full\n",
    "             .pipe(rollup_dup_sites)\n",
    "             .pipe(cols_aqs))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now the `aqs_sites` dataframe is ready, and we move to the PurpleAir sites."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Wrangling the List of PurpleAir Sites"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike the AQS sites, the file containing PurpleAir sensor data comes in a\n",
    "JSON format. We address this format in more detail in {numref}`Chapter %s <ch:web>`. For now, we use shell tools (see {numref}`Chapter %s <ch:files>`) to peak at the  file contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{\"version\":\"7.0.30\",\n",
      "\"fields\":\n",
      "[\"ID\",\"pm\",\"pm_cf_1\",\"pm_atm\",\"age\",\"pm_0\",\"pm_1\",\"pm_2\",\"pm\n",
      "\"data\":[\n",
      "[20,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97,0.0,0.0,0.0\n",
      "[47,null,null,null,4951,null,null,null,null,null,null,null,9\n",
      "[53,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,1.2,5.2,6.0,97,0.0,0.5,702\n",
      "[74,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,97,0.0,0.0,0.0\n",
      "[77,9.8,9.8,9.8,1,9.8,10.7,11.0,11.2,13.8,15.1,15.5,97,9.7,9\n",
      "[81,6.5,6.5,6.5,0,6.5,6.1,6.1,6.6,8.1,8.3,9.7,97,5.9,6.8,405\n"
     ]
    }
   ],
   "source": [
    "!head data/list_of_purpleair_sensors.json | cut -c 1-60"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "From the first few lines of the file, we can guess that the data are\n",
    "stored in the `\"data\"` key and the column labels in the `\"fields\"` key.\n",
    "We can use Python's `json` library to read in the file as a Python `dict`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['version', 'fields', 'data', 'count']"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import json\n",
    "\n",
    "with open('data/list_of_purpleair_sensors.json') as f:\n",
    "    pa_json = json.load(f)\n",
    "\n",
    "list(pa_json.keys())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can create a dataframe from the values in `data` and label the columns with the content of `fields`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ID</th>\n",
       "      <th>pm</th>\n",
       "      <th>pm_cf_1</th>\n",
       "      <th>pm_atm</th>\n",
       "      <th>...</th>\n",
       "      <th>Voc</th>\n",
       "      <th>Ozone1</th>\n",
       "      <th>Adc</th>\n",
       "      <th>CH</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>20</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.01</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>47</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0.72</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>53</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.00</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>74</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.05</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>77</td>\n",
       "      <td>9.8</td>\n",
       "      <td>9.8</td>\n",
       "      <td>9.8</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>0.01</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 36 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "   ID   pm  pm_cf_1  pm_atm  ...  Voc  Ozone1   Adc  CH\n",
       "0  20  0.0      0.0     0.0  ...  NaN     NaN  0.01   1\n",
       "1  47  NaN      NaN     NaN  ...  NaN    0.72  0.72   0\n",
       "2  53  0.0      0.0     0.0  ...  NaN     NaN  0.00   1\n",
       "3  74  0.0      0.0     0.0  ...  NaN     NaN  0.05   1\n",
       "4  77  9.8      9.8     9.8  ...  NaN     NaN  0.01   1\n",
       "\n",
       "[5 rows x 36 columns]"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa_sites_full = pd.DataFrame(pa_json['data'], columns=pa_json['fields'])\n",
    "pa_sites_full.head()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Like the AQS data, there are many more columns in this dataframe  than we need:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['ID', 'pm', 'pm_cf_1', 'pm_atm', 'age', 'pm_0', 'pm_1', 'pm_2', 'pm_3',\n",
       "       'pm_4', 'pm_5', 'pm_6', 'conf', 'pm1', 'pm_10', 'p1', 'p2', 'p3', 'p4',\n",
       "       'p5', 'p6', 'Humidity', 'Temperature', 'Pressure', 'Elevation', 'Type',\n",
       "       'Label', 'Lat', 'Lon', 'Icon', 'isOwner', 'Flags', 'Voc', 'Ozone1',\n",
       "       'Adc', 'CH'],\n",
       "      dtype='object')"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa_sites_full.columns"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, we can guess that the columns we're most interested in are the\n",
    "sensor IDs (`ID`), sensor labels (`Label`), latitude (`Lat`), and longitude (`Lon`).\n",
    "But we did consult the data dictionary on the PurpleAir website to double-check."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's check the `ID` column for duplicates, as we did for the AQS data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "85829     1\n",
       "117575    1\n",
       "118195    1\n",
       "Name: ID, dtype: int64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa_sites_full['ID'].value_counts()[:3]"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Since the `value_counts()` method lists the counts in descending order, we can see\n",
    "that every ID was included only once. So we have verified the granularity is at the individual sensor level.\n",
    "Next, we keep only the columns needed to match sensor locations from the two sources:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cols_pa(df):\n",
    "    subset = df[['ID', 'Label', 'Lat', 'Lon']]\n",
    "    subset.columns = ['id', 'label', 'lat', 'lon']\n",
    "    return subset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(23138, 4)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pa_sites = (pa_sites_full\n",
    "            .pipe(cols_pa))\n",
    "pa_sites.shape"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice there are tens of thousands more PurpleAir sensors than AQS sensors.\n",
    "Our next task is to find the PurpleAir sensor close to each AQS sensor."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Matching AQS and PurpleAir Sensors"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Our goal is to match sensors in the two dataframes by finding a PurpleAir sensor near each AQS instrument. \n",
    "We consider near to mean within 50 meters.\n",
    "This kind of matching is a bit more challenging than the joins we've seen thus far.\n",
    "For instance, the naive approach to use the `merge` method of `pandas` fails us:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>site_id</th>\n",
       "      <th>lat</th>\n",
       "      <th>lon</th>\n",
       "      <th>id</th>\n",
       "      <th>label</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>06-111-1004</td>\n",
       "      <td>34.45</td>\n",
       "      <td>-119.23</td>\n",
       "      <td>48393</td>\n",
       "      <td>VCAPCD OJ</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       site_id    lat     lon     id      label\n",
       "0  06-111-1004  34.45 -119.23  48393  VCAPCD OJ"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "aqs_sites.merge(pa_sites, left_on=['lat', 'lon'], right_on=['lat', 'lon'])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We cannot simply match instruments with the exact same latitude and longitude;\n",
    "we need to find the PurpleAir sites that are close enough to the AQS instrument. \n",
    "\n",
    "To figure out how far apart two locations are, we use a basic approximation: `111,111` meters in the\n",
    "north–south direction roughly equals one degree of latitude, and `111,111 * cos(latitude)` in the east–west direction\n",
    "corresponds to one degree of longitude.[^latlon]\n",
    "So we can find the latitude and longitude ranges that correspond to 25 meters\n",
    "in each direction (to make a 50-by-50 meter rectangle around each point):\n",
    "\n",
    "[^latlon]: This estimation works by assuming that the Earth is perfectly spherical.\n",
    "Then, one degree of latitude is the\n",
    "radius of the Earth in meters.\n",
    "Plugging in the average radius of the Earth gives 111,111 meters per degree of latitude.\n",
    "Longitude is the same, but the radius of each \"ring\" around the Earth decreases\n",
    "as we get closer to the poles, so we adjust by a factor of $ \\cos(\\text{lat}) $ .\n",
    "It turns out that the Earth isn't perfectly spherical, so these estimations\n",
    "can't be used for precise calculations, like landing a rocket.\n",
    "But for our purposes, they do just fine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.000225000225000225"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "magic_meters_per_lat = 111_111\n",
    "offset_in_m = 25\n",
    "offset_in_lat = offset_in_m / magic_meters_per_lat\n",
    "offset_in_lat"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To simplify even more, we use the median latitude for the AQS sites:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.000291515219937587"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "median_latitude = aqs_sites['lat'].median()\n",
    "magic_meters_per_lon = 111_111 * np.cos(np.radians(median_latitude))\n",
    "offset_in_lon = offset_in_m / magic_meters_per_lon\n",
    "offset_in_lon"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can match coordinates to within the `offset_in_lat` and `offset_in_lon`.\n",
    "Doing this in SQL is much easier than in `pandas`, so we\n",
    "push the tables into a temporary SQLite database, then run a query to read\n",
    "the tables back into a dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sqlalchemy\n",
    "\n",
    "db = sqlalchemy.create_engine('sqlite://')\n",
    "\n",
    "aqs_sites.to_sql(name='aqs', con=db, index=False)\n",
    "pa_sites.to_sql(name='pa', con=db, index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 96,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>aqs_id</th>\n",
       "      <th>pa_id</th>\n",
       "      <th>pa_label</th>\n",
       "      <th>aqs_lat</th>\n",
       "      <th>aqs_lon</th>\n",
       "      <th>pa_lat</th>\n",
       "      <th>pa_lon</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>06-019-0011</td>\n",
       "      <td>6568</td>\n",
       "      <td>IMPROVE_FRES2</td>\n",
       "      <td>36.79</td>\n",
       "      <td>-119.77</td>\n",
       "      <td>36.79</td>\n",
       "      <td>-119.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>06-019-0011</td>\n",
       "      <td>13485</td>\n",
       "      <td>AMTS_Fresno</td>\n",
       "      <td>36.79</td>\n",
       "      <td>-119.77</td>\n",
       "      <td>36.79</td>\n",
       "      <td>-119.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>06-019-0011</td>\n",
       "      <td>44427</td>\n",
       "      <td>Fresno CARB CCAC</td>\n",
       "      <td>36.79</td>\n",
       "      <td>-119.77</td>\n",
       "      <td>36.79</td>\n",
       "      <td>-119.77</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>146</th>\n",
       "      <td>53-061-1007</td>\n",
       "      <td>3659</td>\n",
       "      <td>Marysville 7th</td>\n",
       "      <td>48.05</td>\n",
       "      <td>-122.17</td>\n",
       "      <td>48.05</td>\n",
       "      <td>-122.17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>147</th>\n",
       "      <td>53-063-0021</td>\n",
       "      <td>54603</td>\n",
       "      <td>Augusta 1 SRCAA</td>\n",
       "      <td>47.67</td>\n",
       "      <td>-117.36</td>\n",
       "      <td>47.67</td>\n",
       "      <td>-117.36</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>148</th>\n",
       "      <td>56-021-0100</td>\n",
       "      <td>50045</td>\n",
       "      <td>WDEQ-AQD Cheyenne NCore</td>\n",
       "      <td>41.18</td>\n",
       "      <td>-104.78</td>\n",
       "      <td>41.18</td>\n",
       "      <td>-104.78</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>149 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "          aqs_id  pa_id                 pa_label  aqs_lat  aqs_lon  pa_lat  \\\n",
       "0    06-019-0011   6568            IMPROVE_FRES2    36.79  -119.77   36.79   \n",
       "1    06-019-0011  13485              AMTS_Fresno    36.79  -119.77   36.79   \n",
       "2    06-019-0011  44427         Fresno CARB CCAC    36.79  -119.77   36.79   \n",
       "..           ...    ...                      ...      ...      ...     ...   \n",
       "146  53-061-1007   3659           Marysville 7th    48.05  -122.17   48.05   \n",
       "147  53-063-0021  54603          Augusta 1 SRCAA    47.67  -117.36   47.67   \n",
       "148  56-021-0100  50045  WDEQ-AQD Cheyenne NCore    41.18  -104.78   41.18   \n",
       "\n",
       "     pa_lon  \n",
       "0   -119.77  \n",
       "1   -119.77  \n",
       "2   -119.77  \n",
       "..      ...  \n",
       "146 -122.17  \n",
       "147 -117.36  \n",
       "148 -104.78  \n",
       "\n",
       "[149 rows x 7 columns]"
      ]
     },
     "execution_count": 96,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "query = f'''\n",
    "SELECT\n",
    "  aqs.site_id AS aqs_id,\n",
    "  pa.id AS pa_id,\n",
    "  pa.label AS pa_label,\n",
    "  aqs.lat AS aqs_lat,\n",
    "  aqs.lon AS aqs_lon,\n",
    "  pa.lat AS pa_lat,\n",
    "  pa.lon AS pa_lon\n",
    "FROM aqs JOIN pa\n",
    "  ON  pa.lat - {offset_in_lat} <= aqs.lat\n",
    "  AND                             aqs.lat <= pa.lat + {offset_in_lat}\n",
    "  AND pa.lon - {offset_in_lon} <= aqs.lon\n",
    "  AND                             aqs.lon <= pa.lon + {offset_in_lon}\n",
    "'''\n",
    "matched = pd.read_sql(query, db)\n",
    "matched"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've achieved our goal---we matched 149 AQS sites with PurpleAir sensors.\n",
    "Our wrangling of the locations is complete, and we turn to the task of \n",
    "wrangling and cleaning the sensor measurements. We start with the measurements taken from an AQS site."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
